[fix](ann-index) Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient rows.#64082
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
1 similar comment
|
run buildall |
a66b5d7 to
582071f
Compare
|
run buildall |
TPC-H: Total hot run time: 29269 ms |
TPC-DS: Total hot run time: 169468 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: Clarify why ANN index writer swaps the buffered vectors with an empty PODArray instead of using clear(). The swap intentionally releases the full-segment training buffer before saving the index, while clear() would keep the allocated capacity. ### Release note None ### Check List (For Author) - Test: No need to test (comment-only change) - Behavior changed: No - Does this need documentation: No
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: Remove the redundant ANN writer `_skip_build` state. The flag was only set from `close_on_error()`, while normal index skip behavior is already driven by zero rows or by the segment row count being smaller than the index training requirement. Keeping the writer state explicit avoids carrying an abort flag into regular add and finish paths. ### Release note None ### Check List (For Author) - Test: Unit Test - `ENABLE_PCH=OFF ./run-be-ut.sh --run --filter=AnnIndexWriterTest.*` - Behavior changed: No - Does this need documentation: No
TPC-H: Total hot run time: 29312 ms |
TPC-DS: Total hot run time: 169349 ms |
…added no-train indexes during segment writing. This made the build strategy harder to reason about and could still spend CPU/memory building small HNSW/FLAT segments that should be skipped by a Doris-side row threshold. This change removes the chunk add configs, buffers ANN vectors for the whole segment, applies effective_min_rows = max(vector_index->get_min_train_rows(), config::ann_index_build_min_segment_rows) in finish(), and then trains when needed, adds once, releases the build buffer, and saves the index. Empty segments or segments below the effective threshold delete only the current index entry instead of persisting an ANN index. Add BE config ann_index_build_min_segment_rows to skip persisting ANN indexes for small segments. Remove ann_index_build_add_chunk_size and ann_index_build_add_chunk_bytes.
|
run buildall |
TPC-H: Total hot run time: 29312 ms |
TPC-DS: Total hot run time: 169025 ms |
| } | ||
|
|
||
| Status AnnIndexColumnWriter::_append_vectors_to_buffer(const float* vectors, size_t num_rows) { | ||
| DCHECK(vectors != nullptr); |
There was a problem hiding this comment.
之前他有一个验证dim 的长度的问题,你这里没了?
still in add_array_values()
|
|
||
| _dir = compound_dir.value(); | ||
|
|
||
| _min_segment_rows = AnnIndexColumnWriter::min_segment_rows(); |
There was a problem hiding this comment.
这行代码是在干啥?
Minimum segment rows required to persist an ANN index.
| return Status::OK(); | ||
| } | ||
|
|
||
| Status AnnIndexColumnWriter::_build_and_save(Int64 min_train_rows, Int64 effective_min_rows) { |
There was a problem hiding this comment.
这个函数,为什么要有min_train_rows 这个参数?
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: The ANN writer buffers vectors through an internal helper after validating array dimensions in add_array_values(). Add a short comment to make the validation precondition explicit for the buffer helper path. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: The ANN writer used a tiny helper only to compute max(min_train_rows, ann_index_build_min_segment_rows). Inline the single-use calculation in finish() to keep the build threshold logic local and reduce unnecessary indirection. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Ran rg to verify _effective_min_rows has no remaining references - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#64082 Problem Summary: The ANN writer had small single-use helpers and a cached min segment rows member after switching to finish-time buffering. Inline vector buffering, buffer release, and direct ann_index_build_min_segment_rows access at their call sites to keep the writer implementation simpler. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Ran rg to verify _append_vectors_to_buffer, _release_buffered_vectors, _min_segment_rows, and min_segment_rows() have no remaining references - Behavior changed: No - Does this need documentation: No
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29625 ms |
TPC-DS: Total hot run time: 170617 ms |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary:
This PR fixes several ANN index build issues:
ann_index_build_chunk_size * dimfloats during init, which could allocate excessive memory immediately for high-dimensional vectors.nlistas its minimum FAISS training row requirement.This PR changes the build behavior as follows:
ann_index_build_min_segment_rowsso small ANN indexes can be skipped by a Doris-side row threshold.Release note
Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient training rows.
Check List (For Author)
Test
./run-regression-test.sh --run -d ann_index_p0 -s ivf_pq_full_buffer_train_recallrun buildallBehavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)